-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hugging Face Datasets Plugin #1116
hugging Face Datasets Plugin #1116
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1116 +/- ##
==========================================
+ Coverage 68.38% 68.51% +0.12%
==========================================
Files 288 288
Lines 25963 26095 +132
Branches 2899 2920 +21
==========================================
+ Hits 17756 17880 +124
- Misses 7728 7736 +8
Partials 479 479
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@esadler-hbo you rock! Love all 3 integration goals. @esadler-hbo & @samhita-alla would you folks be open to writing a blog? |
plugins/flytekit-huggingface/flytekitplugins/huggingface/sd_transformers.py
Show resolved
Hide resolved
plugins/flytekit-huggingface/flytekitplugins/huggingface/sd_transformers.py
Outdated
Show resolved
Hide resolved
@esadler-hbo let me take a look at this and try to fix some of the handling around protocol |
@wild-endeavor amazing! I’ll get a chance to work on this more this weekend. |
Thanks! Yeah I really need to get to those changes I was talking about today. |
tagged you also on the other PR @easadler-hbo |
3dc8344
to
e242ff7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esadler-hbo Thanks, Overall LGTM. Some minor comments below.
TL;DR
Hugging Face provides great packages to make working with state-of-the-art language models easy. Integrating with Flyte would connect ETL to the training and inference of deep learning models seamlessly.
Type
Are all requirements met?
Complete description
You can use Hugging face to create high quality embeddings, which is becoming really valuable to a lot of companies. Flyte could elegantly handle the different infrastructure considerations. Notice there is no model training, which makes this workflow especially great.
The first integration is adding Hugging Face's datasets into Flyte's
StructuredDatasets
. Their datasets is a very performant way to pass data into neural networks. It is based ontf.data.Dataset
, but uses Arrow instead of TFRecords. I am excited by the idea of having an ETL job output apyspark.sql.DataFrame
and then doing batch training and batch inference with a Hugging Face dataset seamlessly.The second integration would be coming up highly scalable task for step 2 in the following workflow:
I have heard from @gdj0nes that this is common workflow that has infra pain points.
Finally, Hugging Face has a platform where you can save datasets, models, and deploy ML applications. There are opportunities to integrate with their platform that should be mentioned, but are lower priority.
Tracking Issue
https://github.com/flyteorg/flyte/issues/
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/